30 research outputs found

    Example file format of training dataset used in machine learning.

    No full text
    <p>There is one protein per line that consists of the total binding affinity score for each peptide-MHC length combination e.g. 304 combinations for 76 common MHC I alleles (MHC I binds to peptides, typically eight to eleven amino acid residues in length. Therefore, 76 alleles * 4 peptide lengths  = 304 combinations). Binding affinity score  =  an IEDB IC<sub>50</sub> (nM) score <5000. Each score is weighted by the length of the protein. The scores represent input variables or predictors. The last column is a 1 or 0 that indicates an expected ‘YES’ or ‘NO’ vaccine candidacy and represents the target variable. This expectation is based on the subcellular location annotation associated with the protein in UniProtKB (secreted or membrane-associated  = 1, internal location  = 0).</p

    Comparison of test genes not identified by gene finders.

    No full text
    ++<p>Number of groups of test genes not found in which the test genes are located consecutively along the chromosome.</p><p>The highest number of test genes in a consecutive group.</p

    Comparison of genomic start and end locations of gene predictions with 299 test genes.

    No full text
    <p>Abbreviations:</p><p>gm = GeneMark_hmm, aug = AUGUSTUS, gl = GlimmerHMM.</p>**<p>See <a href="http://www.plosone.org/article/info:doi/10.1371/journal.pone.0050609#pone-0050609-g005" target="_blank">Figure 5</a> for explanation on classifications.</p>++<p>Number of predicted genes that predict part of an entire gene such that there can be more than one prediction to the same test gene.</p><p>Number of predictions that did not overlap the test genes in any way.</p

    Schematic representation of gene prediction evaluation at the exon level.

    No full text
    <p>Exons are represented by shaded rectangles. Introns are represented by the adjoining solid lines. Abbreviations: TP = true positive, FP = false positive, and FN = false negative.</p

    Number of BLASTX hits using DNA consensus sequences from AUGUSTUS and GlimmerHMM predictions.

    No full text
    <p>The figure shows the BLASTX hits when using the consensus of predicted sequences from AUGUSTUS and GlimmerHMM as queries in an attempt to find novel <i>Toxoplasma gondii</i> proteins. These consensus sequences were derived from aligning predicted DNA sequences based on overlapping genomic locations (see text for details).</p

    Example of rule-based approach applied to highest affinity peptide on each test protein.

    No full text
    <p>Proteins are listed in ascending order based on the lowest IC<sub>50</sub> (nM) binding affinity score. A threshold value e.g. 1.5 is applied to the score to segregate the list into two classifications. Below the threshold is ‘YES’ for vaccine candidacy and above is ‘NO’. The rule-based classification is compared with the expected classification to determine performance accuracy. Threshold value is derived from a trial-and-error approach with the intention to classify the greatest number of true positives and negatives.</p

    Plot of conservation scores computed for binding peptides along a protein (UniProtKB ID: P13664).

    No full text
    <p>Each circle represents the amino acid conservation score computed at a sliding window. The window is of length 9 and slides one residue at a time. The colour of the circle represents binding affinities against 76 common MHC alleles computed at each window. A window (i.e. a peptide) can theoretically bind to all 76 alleles and colours are therefore plotted in a set order: no, low, intermediate, and high affinity. For example, a dark blue circle for low affinity indicates there are no intermediate or high affinity peptides at the window; however, a green circle for high affinity provides no indication of other affinities at the same window. Mean conservation  = 0.7805; median conservation  = 0.7946. For protein P13664 (Major surface antigen p30) 54.6% high, 56% intermediate, and 55.9% low binders have conservation scores below the mean. The study shows that vaccine candidates are significantly more likely to have either a greater number of less conserved peptides or a lower total conservation score than non-vaccine candidates.</p

    Example of online output from IEDB peptide-MHC class I binding predictor.

    No full text
    <p>The binding predictor conceptually slides a window of a user-defined length (either eight to eleven amino acid residues) one residue at a time from the start of the protein sequence. An affinity score is predicted for the ability of each fixed-length subsequence (as defined by each position of the sliding window) to bind to a user-specified MHC I allele. Fig. 1 shows the output when a sequence (e.g. MARHAIFFALCVLGL…) is input into the program to predict if it contains peptides of length 9 that bind to the MHC allele, HLA-A*11∶01. The IC<sub>50</sub> (nM) affinity scores for subsequence ‘MARHAIFFA’ at position 1 to 9 are highlighted.</p

    Sensitivity and specificity for random forest tests applied to peptide-MHC binding scores for vaccine classification of Benchmark dataset.

    No full text
    <p>Abbreviations: (R)  =  target variable e.g. 1 or 0 in training data randomly changed for each protein, HE  =  hold-out dataset error (%) i.e. error when predicting 30% of training data, OE  =  overall error (%) i.e. percentage of incorrect predictions, SN  =  sensitivity (%)  =  true positives/(true positives+false negatives), SP  =  specificity (%)  =  true negatives/(true negatives+false positives).</p>a<p>Cross-validation involved a random sample of 70% from training dataset to build predictive model and remaining 30% used for testing. This was repeated 10 times and predictions averaged (predictions for the same input data fluctuate unless a random seed is set initially).</p>b<p>Benchmark are proteins from published studies with known or expected T-cell responses (source species: <i>T. gondii</i>) –100% from training data used to build predictive model.</p><p>Note: Number of input variables used to build predictive model  = 304 (i.e. number of allele-peptide length combinations derived from 76 common alleles).</p><p>Sensitivity and specificity for random forest tests applied to peptide-MHC binding scores for vaccine classification of Benchmark dataset.</p

    Number of matching predicted genes with 299 test genes using BLASTN (with 250, 500, and 1000 training genes).

    No full text
    <p>Abbreviations:</p><p>gl = GlimmerHMM; aug = AUGUSTUS.</p><p>N/A = not applicable – the AUGUSTUS training program does not give the option to control the number of bases that precede and follow the coding segment (CDS) sequence of the training genes.</p><p>Number of predicted genes that align entirely or partly with the test genes and meet the criteria E-value  =  0 and 100% coverage – a value in brackets is the number of predicted genes that are exactly the same as the test genes i.e. each exon genomic coordinate is the same.</p>++<p>Number of predicted genes that align to the same test gene i.e. the predicted gene is only a part of the entire test gene and there can be one or more predictions per test gene.</p><p>The values underlined indicate the highest number of matches for each gene finder.</p
    corecore